Numerical Replication of Computer Simulations: Some Pitfalls and 

How To Avoid Them * 
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Abstract 

A computer simulation, such as a genetic algo- 
rithm, that uses IEEE standard floating-point 
arithmetic may not produce exactly the same 
results in two different runs, even if it is rerun 
on the same computer with the same input and 
random number seeds. Researchers should not 
simply assume that the results from one run 
replicate those from another but should verify 
this by actually comparing the data. However, 
researchers who are aware of this pitfall can re- 
liably replicate simulations, in practice. This 
paper discusses the problem and suggests so- 
lutions. 



1 INTRODUCTION 

When we perform computer simulations, such as ge- 
netic algorithms, it is often useful to be able to replicate 
a run exactly, so that those results of the second run 
that we care about are exactly the same as those of the 
first. This kind of replication is called numerical repli- 
cation [Q. For instance, if we notice a strange result in a 
run, it is useful to be able to redo the run exactly, using 
the same parameter settings and random number seeds, 
but this time collecting additional data or perturbing 
the course of the run in order to test hypotheses about 
what is causing the strange results. Numerical replica- 
tion can also be used to verify that experimental results 
are not due to a bug or human error, and to perform 
regression testing after making changes to a program. 

However, a program that uses IEEE standard floating- 
point arithmetic [p^ ^ may produce different results 
on two different computers, even if the same input and 
random number seeds are used. In fact, there is no 
guarantee that it will produce the same results when 
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run twice on the same computer, or even that a subex- 
pression will have the same value when evaluated at two 
different points during a single run. This is because the 
calculations may be performed at different precisions 
each time, and the programmer has little control over 
what precision is used pl| ]. This can cause numerical 
replication to fail unexpectedly. In the worst case, this 
can lead us to believe that two different sets of results 
are the same, and thereby cause us to draw incorrect 
conclusions. 

Luckily, if we are aware of these pitfalls, we can reli- 
ably avoid them in practice. We should not simply as- 
sume that two sets of data are the same because we used 
the same input and random number seeds; instead, we 
should always verify this empirically. Furthermore, we 
should always record the computer platform and run- 
time and compile-time parameters that we use along 
with the simulation data. This will make numerical 
replication easier to achieve. We need tools to make 
both of these tasks easy and automatic. Finally, we need 
to compile a knowledgebase of heuristics for achieving 
numerical replication. In the remainder of this paper, 
I discuss the problem further and justify these recom- 
mendations. 



2 THE PROBLEM 

Computers perform arithmetic mainly on two kinds 
of numbers: integers (such as 42) and real numbers 
(such as 3.14159). There are various possible ways 
to represent real numbers in a computer; almost all 
modern computers use binary floating-point representa- 
tions H^, |l^, |l^. This representation is essentially 
the same as scientific notation, except in binary; be- 
sides representing real numbers, it can also be used to 
represent very large integers. Floating-point numbers 
are represented in the form (—1)'' x 1./ x 2^. The part 
s is the sign bit (0 or 1), 1./ is called the significand 
(an older term is mantissa), f is called the fraction, 
where < / < 1, and x is called the exponent. Al- 



#include <stdio.h> 

int mainO { 
double q; 

q = 3.0/7.0; 

if (q == 3.0/7.0) printf ("EqualXn") ; 
else printf ("Not EqualXn"); 
return 0; 

> 

Figure 1: Example program that may produce different 
results on different computer platforms, from Priest |^ 

#include <fpu_control .h> 

void attribute ((constructor)) 

enter_f pu_double_mode () { 

(void) __setfpucw ( (_FPU_DEFAULT & 
~_FPU_EXTENDED) I _FPU_DDUBLE) ; 

} 

Figure 2: Simply compile and link this C code with 
a program using the gcc compiler on x86 or m68k 
platforms to put the FPU in double-precision mode. 
Adapted from the g77 manual |Q. Other compilers 
should provide a similar way to do this. On x86, more 
needs to be done to completely emulate double preci- 
sion; see the text for details. 

most all computer platforms used today use IEEE 754 
standard ^ floating-point arithmetic. The only im- 
portant exceptions are the Cray X-MP, Y-MP, C90, and 
J90, the IBM /370 and 3090, and the DEC VAX; most 
of these are disappearing rapidly |Q. This standard 
has caused floating-point arithmetic to be much more 
reliable, predictable, and portable. 

However, the standard does not guarantee that a pro- 
gram will produce the same results when run on two dif- 
ferent computers ||2l| . This is because different comput- 
ers may perform floating point calculations differently, 
even if the computers all follow the IEEE standard. The 
standard specifies three different precisions for floating- 
point arithmetic: single precision (32 bits long), double 
precision (64 bits) and double extended precision (also 
called simply extended precision, 80 or more bits). Dif- 
ferent computer platforms support these precisions to 
different extents. 

We may get different results, for instance, when we run 
a simulation once on a Hewlett-Packard (HP) PA-RISC 
workstation running HPUX and once on an Intel x86 PC 
running Linux or MS Windows. The HP workstation 
uses single-precision and double-precision floating point 
arithmetic, while the x86 uses IEEE 80-bit extended- 



precision floating-point arithmetic by default. (The Mo- 
torola 680x0 (m68k) is another CPU family that uses 
80-bit extended precision; it was used in the first Mac- 
intosh computers.) Figure |l| shows an example program 
that may produce a different result on each platform, de- 
pending on the compiler and the compile-time settings. 
An HP workstation will print "Equal", while an x86 
computer may print either "Equal" or "Not Equal" , de- 
pending on the compiler and compile-time options that 
are used 

If the results of a simulation depend on many floating- 
point calculations, this difference in precision may cause 
the two runs to produce wildly different results. This is 
particularly likely in simulations of complex systems, 
such as a genetic algorithm, where the simulation's pre- 
cise trajectory is highly sensitive to the initial condi- 
tions and to the stream of random numbers. Even if the 
different runs produce the same qualitative results, the 
numeric results may differ. 

This may occur with any program that uses native IEEE 
floating-point arithmetic, written in any language, on 
any computer or operating system. Discrepancies may 
also occur in integer arithmetic, but only if a program 
makes unwarranted assumptions about the size or rep- 
resentation of integer variables (for example, assuming 
that C variables of type int are 32 bits long) . 

Both the x86's and the m68k's floating-point unit (FPU) 
can be switched into "single-precision" or "double- 
precision" mode (see Figure ^). This solves this par- 
ticular problem on the m68k. Unfortunately, even when 
the x86 is in one of these modes, it will still produce dif- 
ferent results than an HP or similar workstation would, 
since its internal registers will still use more bits of pre- 
cision for the exponents (15 bits instead of 8 or 11) |]l^ . 
To reduce the exponent range to be the same as that 
in "pure" single or double precision, the result must be 
stored to memory from the x86 FPU's internal regis- 
ters, and then reloaded from memory into the FPU. 
This will cause the computation to be two to four times 
slower than native floating-point arithmetic. If the gcc 
compiler is being used, this can be accomplished by us- 
ing the -f float-store compiler option. However, there 
may still be a discrepancy on the x86 in the last bit of 
about 10"'^^'* because of double rounding, if the floating- 
point operation is a multiplication or division. To avoid 
this, one of the operands must be scaled down before the 
operation by 2^"="'<!xtondod "^"■^^doubio , where a;maxextcnded 
the maximum possible exponent for extended precision 
and a^maxdoubie maximum possible exponent for 

double precision, and the result must be scaled back up 
by the same amount afterwards. This additional scaling 
adds only marginally to the computation time |l2j. 

By using this technique, replication problems can be 



made much less likely, at the expense of computation 
speed. However, it will not guarantee that such prob- 
lems will not occur. In fact, a program may produce 
different results when run twice on the same computer, 
even if the same input and random number seeds are 
used. This is because the results produced by a pro- 
gram depend not only on the computer's floating-point 
unit and operating system but also on the compiler, 
the compile-time options, the compile-time and run- 
time libraries installed, and the input (here I include the 
date, the run-time environment, and the random num- 
ber seeds). For instance, the discrepancy may occur if 
we run a simulation twice on a x86 computer, where the 
simulation is compiled the first time to store floating- 
point results to memory, and the second time to keep 
the results in the FPU's internal registers. Also, the 
libraries of mathematical functions such as log and sin 
may produce different results on different platforms 
and may also differ from version to version on the same 
platform. (The IEEE standard only contains specifica- 
tions for the square root function y^.) The IEEE stan- 
dard also does not completely specify the accuracy con- 
version between binary and decimal representations. It 
is even technically possible that the results may depend 
on what other programs are running on the computer, 
or on bugs in the program, compiler, or libraries — this 
is especially true if the program is not carefully designed 
and implemented. Therefore, each time a simulation is 
run, it is prudent to act as if it were run on a different 
computer, even if the computer is in fact always exactly 
the same. 

A related issue is that if a floating-point expression oc- 
curs more than once in different locations in a program, 
it may be evaluated to different precisions each time it 
is used during a single run For example, on an x86 
computer the compiler may choose whether to keep a re- 
sult in extended precision in the FPU or store it in dou- 
ble precision to memory based on the optimization level, 
the number of free floating-point registers, whether the 
result will be used as the argument to a function, and 
many other factors. (The forthcoming C99 ANSI/ISO C 
standard will guarantee that if the expression is stored 
in a variable, the same precision will be used whenever 
the variable's value is evaluated.) Besides complicating 
numerical replication, this may cause problems if the 
program assumes that the expression always evaluates 
to the same value during the course of a run. The f cmp 
package Q implements Knuth's Q suggestions for safer 
floating-point comparisons, which can be used to avoid 
this. 

Finally, some CPUs, such as the PowerPC, provide an 
operation called fused multiply- add that can perform the 
operation ±ax±b in a single instruction. If this instruc- 
tion is used, a different result may be produced than if 



it is not used, since there is one less rounding step [ |21| . 
Also, in expressions such as ±ax ± by it is ambigu- 
ous which side is evaluated first (and hence rounded). 
Therefore, this instruction must not be used in certain 
algorithms, for instance when multiplying a complex 
number by its conjugate [|l5[ Unfortunately, many 
compilers make it difficult for the programmer to specify 
whether this instruction should be allowed or inhibited 
in a program. 

Guaranteeing that two runs of a program will pro- 
duce exactly the same results is extremely difficult and 
may be impossible in practice. Every component which 
might affect the results would have to be guaranteed to 
be the same for both runs; none of these components 
could ever be changed or upgraded unless the new ver- 
sion could be shown to have no effect on the results. On 
the one hand, determining the version of every compo- 
nent on a computer and recording all of this information 
with the simulation data would be extremely expensive 
in time and storage. On the other hand, it will be ex- 
tremely difficult to weed out false positive results when 
testing whether two computers have different compo- 
nents: The fact that one of two otherwise identical com- 
puters has a copy of the game Quake installed probably 
will not affect whether a simulation will produce identi- 
cal results on the two machines, but it will be difficult 
to prove this. Finally, the date is always changing, and 
this might have unforeseen effects on a program's be- 
havior. (Consider the recent Y2k problem, or the bug 
that depended on the phase of the moon ||2^.) 

3 RECOMMENDATIONS 

If guaranteeing that we can numerically replicate a run 
is not an option, what can we do? I suggest that instead 
of asking how we can guarantee replication, we should 
ask two different questions: First, what is the worst-case 
result that can occur because of this problem, and how 
can we avoid it? Secondly, how can we make numerical 
replication easier to achieve and more reliable? 

The worst thing that can happen when we try to numer- 
ically replicate a run is that we mistakenly believe that 
the replicated results are exactly the same as the origi- 
nal, when they are in fact different. Our main concern 
should be to avoid this mistake. Luckily, there is an easy 
way to avoid it: Simply compare the data sets. If they 
are empirically identical, we are done. (Of course, if we 
do not record enough data from each run, it is possible 
that the runs' actual trajectories may be different, even 
though the data are the same.) 

Therefore, we need a set of easy-to-use tools to compare 
results from two runs, and we should use these tools even 
if the runs were done on the same computer, as a sanity 
check. In some cases, where entire files need to match 



exactly, a utility such as the Unix dif f command may 
suffice. In other cases, I suggest using Rivest's [|4| MD5 
message digest algorithm. This algorithm produces a 
short string (called a hash) that is easy to store with 
the data that it is computed from. Instead of compar- 
ing entire files, only the hash string from each file needs 
to be compared. If the data files clearly mark comments 
and other data that we do not need to replicate, such 
as the date of the run, then it is easy to write a short 
Perl program to compute an MD5 hash string 

from a data file, ignoring such extraneous information. 
(One common convention for marking comments in text 
files is to put a pound sign '#' at the beginning of each 
comment line.) If it is necessary to ensure that a dataset 
has not been tampered with in any way, there are cryp- 
tographically secure methods, such as signing the data 
set with PGP (28[ § or GnuPG (l9|, or using another 



message digest algorithm, such as RIPEMD-160 IJ. 
(MD5 should not be used for this purpose pst!) 

Often, using the same input and random number seed 
will be all that is necessary to numerically replicate 
a run. Sometimes, however, this will not suffice. In 
this case, we can almost always replicate the results by 
tweaking a few special compile-time or run-time param- 
eters (such as what precision the FPU uses). Experi- 
ence suggests that numerical replication is usually easy 
to achieve in practice, even though it may be impossi- 
ble to guarantee. In some cases, it may be necessary 
to rerun the simulation on the same computer platform 
that was used originally. For instance, if a simulation 
is run on an x86 platform using extended precision, it 
will be difficult to numerically replicate the results on 
any platform other than an x86 or m68k using extended 
precision. In addition to the techniques for comparing 
results mentioned previously, we need a set of heuristics 
for numerical replication, such as a list of compile-time 
and run-time parameters that often need to be tweaked. 
One such heuristic is the technique for emulating double- 
precision floating-point on x86 computers described in 
Section ||. 

To make numerical replication easier, the compile-time 
and run-time parameters that were used should be 
stored with a simulation's results, along with informa- 
tion such as the program and compiler versions, the 
date, the name of the machine being used, the platform 
and operating system, etc. In addition to tools for com- 
paring simulation results, we also need tools that make 
storing this kind of information easy and automatic. 
(Perl 1^ is an example of a program that stores a great 
deal of configuration information at compile-time; the 
information is accessible under Unix by running perl 
-V.) A researcher can then use this information when 
trying to numerically replicate the run. For example, if 
a simulation is run on an x86 computer using extended 



precision, it is important to record this fact. 

A future release of Drone ||^ will include tools for record- 
ing and comparing MD5 hashes of data files and for 
recording compile-time and run-time parameters of sim- 
ulation programs. I hope that making these tools avail- 
able will encourage researchers to use them every time 
they run a simulation. 

4 CONCLUSION 

In summary, these are problems that everyone doing 
computer simulations should be aware of, but they are 
not insurmountable. In practice, a few simple tech- 
niques should be sufficient to avoid problems. First, 
we should never assume that the results from two sim- 
ulation runs are identical because they used the same 
parameters and random number seed, even if they are 
run on the same computer. We should always verify 
this, either by comparing the relevant results directly or 
by comparing the MD5 hash strings of the two datasets. 
This verification process should be made so convenient 
that there is no reason not to do it. Secondly, we should 
compile a knowledgebase of likely parameters that can 
be tweaked to achieve numerical replication, if simply 
redoing a run with the same input and random seeds 
does not suffice. Finally, we should always store the 
compile-time and run-time parameters that we use. We 
need tools to make this convenient and automatic, as 
well. 

5 FURTHER READING 

For a technical discussion of this problem, see Priest 
and the Java Grande Forum Numerics Working Group's 
draft report ||l^. For a gentle introduction to floating- 
point arithmetic in general, see Patterson and Hen- 
nessy ||2^ or Goldber g Jill ; for a more technical dis- 
cussion, see Goldberg [|1C| or Knuth ||l|l. The IEEE 754 
floating-point standard is published in p3[ ; for a read- 
able account see Cody, et al. [||. Cody and Coonen 
give C algorithms to support features of the standard. 
Kahan and Darcy Q and Darcy argue that it is 
undesirable to enforce exact replicability across all com- 
puting platforms, and Kahan gives an example of 
differences in floating-point arithmetic in different ver- 
sions of Matlab on various platforms. Axtell, et al. [Q 
discuss the differing degrees to which a simulation can 
be replicated. See Rivest and Robshaw ||2^ for in- 
formation on the MD5 message digest algorithm; for a 
Perl interface to MD5, see Winton [g^- For information 
on RIPEMD-160, a more secure replacement for MD5, 
see Bosselaers Hi or Dobbertin M. 
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